Sydney House Prices


House prices in Sydney have been the subject of great attention in Australia and globally. Specifically, for their extraordinarily high prices. Being a resident of Sydney, I was interested in seeing the relative prices across the different suburbs I live around. I wanted a way I could visualise these geospatial relationships myself.

A choropleth map (from Greek χῶρος choros 'area/region' and πλῆθος plethos 'multitude') is a type of thematic map in which a set of pre-defined areas is colored or patterned in proportion to a statistical variable that represents an aggregate summary of a geographic characteristic within each area, such as population density or per-capita income.

Gathering Data

The data I used can be found here. I use a YAML file for configurations parameters. I use this mainly for more readable code and easier parameter tweaking to help me in the future. They look similar to a dictionary format with keys and value pairs.

YAML (a recursive acronym for "YAML Ain't Markup Language") is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted.

Inspect and Cleaning the Data

As with all important data science tasks, we want to inspect our data and gather some elementary information about the features, labels, variable types amongst other things.


There are 199,504 entries, spanning from 2011-04-16 to 2019-06-19. Quite a large number of data points here. There are 8 columns (zero-indexed) of various types.

The sellPrice and propType columns should be appropriately changed to floats as selling prices are continious variables and property types to categories as they are... categorical. This will make our analyses methods more informative.


Great! Our values now are of a more appropriate type. Now to continue inspecting the dataframe.

There are a few columns that are reduntant for our analyses. We can remove the id and postalCode columns.

To quickly get an overview of our data and it's statistical properties we can use the describe method on the dataframe. We tranpose the dataframe to make viewing it easier.

Immediately some interesting properties stand out.

There is some work to do dealing with the outliers in this dataset it seems.

We can begin by removing the max value in our dataset and then check how this has impacted our summary statistics.

Supposedly this property was sold in 2010, in Zetland, with 99 bedrooms and 41 car spaces. Even if this was a big property development this price does not make sense. For further context the GDP of Estonia is around the same as this outlier.

Lets drop this value.

Woah! Our standard deviation has dropped significantly, as expected, by around 5.5m. This is important has any inferences or analyses would have been quite off the mark if we included our Estonia priced property.

Let us continue by addressing the lower range of our dataset. I believe a sensible amount lower range would be property prices greater than $10,000.

After some preliminary and elementary data preprocessing we can now explore our data and find answers to some interesting questions. Maybe we can begin with:

Which 10 suburbs sold the most properties?


Interesting, Castle Hill located 30 kilometres north-west of the Sydney central business districtand 9.5 kilometres north of Parramatta, tops the list. It is within the Hills District region, split between the local government areas of The Hills Shire and Hornsby Shire. Castle Hill residents have a personal income that is 18.9% greater than the median national income, according to the 2016 Census. This may indicate that Castle Hill may be of interest to property analysts.

Another interesting metric, which will later be used for the choropleth map is the median house price (a statistic that isn't skewed by outliers) for each suburb.

Out of interest, the large outlier which was the 20.7b property, was located in Zetland, where the median house price is 1.130m. More evidence that this value was bonkers.

What are the suburbs with the highest median prices in Sydney?

Unsuprisingly Point Piper is the highest median selling price. Next on the list is Collaroy Beach, another coastal suburb but in the Northern Beaches. It seems that the old adage that coastal properties house the elite may be correct according to these high selling prices.

The next step to get a better feel for our data is to visualise some relationships.

Visualisations

Our data has a temporal dimension. With each sale there is a timestamp that is attached to that sale. An interesting insight may be which where the most popular months of the sale of properties around sydney.

As we see January is the month in which the least number of properties were sold. March was the highest month sold.

Distributions

Can check the distribution of our data types.

Boxplot

A boxplot is a useful way to compare continous values across different categorical types. Here we can analyse differences between different property types in Sydney.

Creating the Choropleth Map

To create a choropleth map using plotly we need a couple of things:

We imported a useful helper function that will grab the relevant json from the internet and saves it as a geopandas dataframe. This will be useful when merging the data.

We can see that the nsw_loca_2 is the most obvious key to match on. The first column are our geometry parameters that will plotly will use to plot our map. However one issue still exists. We need to make sure the strings in the nsw_loca_2are formatted in the same way as the median statistics dataframe. From a quick inspection that dataframe had suburbs in a proper noun format. We will change the geopandas df to reflect this.

Awesome! Now we can merge our two dataframes. We want to do an inner merge, which will match all rows that are common to both dataframes.